Link here for the murder data set
When reading about crime incidents or watching them, do you guys ever think about the statistics of how much more men are prone to kill than women, or why is that killers choose a particular weapon to kill or ever thought about the correlations where a perpetrator and a victim know each-other? In the data that we decided to work with is [US homicide reports 1980-2014]. This dataset includes murders from the FBI’s Supplementary Homicide Report from 1976 to the present and Freedom of Information Act data. Some of this data some cases were either solved or unsolved. It also includes homicides that were not reported to the Justice Department.
This dataset was the most interesting and the most current data available in which contain victims’ age, perpetrators age, location, state, year, weapon used, and data that align with our interest of what factors correlate with behaviors and motives of the killers. Our approach is to analyze the data and filter the dataset in order to answer the following questions; at what rate has the homicide data change since 1980?, what factors contribute to the year with the most kills? and what motives and behaviors contribute to a killer attacking their victim’s. Through our analyzes we will find approaches to our findings. Our data will give a visualization and identify correlations within these murders throughout the year of 1980-2014.
Domain Question
What factors are related to motives and behaviors of the killers?
Other questions
What weapon was used the most?
What state had the most kills?
Do the victims know their perpetrator?
How does gender play a role in homicide incidents?
library(tibble) # used to create tibbles
library(tidyr) # used to tidy up data
library(rmarkdown) # dynamic document
library(ggplot2) # used for data visualization
library(dplyr) # used for data manipulation
library(shiny) # used for showing dynamic visuals in collaboration with ggvis
library(prettydoc)# used for creating pretty documents from R markdown
library(knitr)#for dynamic report generation
library(tidyverse)# multiple tidy up data packages here
library(hms) # used to install kableExtra package
library(kableExtra) # used to construct Complex Table for data
library(dplyr) # used to install tigris package
library(tigris) # used to make states map
#added library for other graphs
library(plotly)
library(rjson)
library(leaflet)
library(leaflet.providers)
library(maps)
library(viridis)
library(viridisLite)
library(sp)
library(quantmod)
library(plot3D)
library(sf)
library(RColorBrewer)
library(gganimate)
Original dataset we have is from Kaggle, “Homicide Report”. Firstly, we had to download our data from Kaggle, which came in a zip file. Once, when we created an R Markdown file and we saved it into a new folder. We extract the zip into the new folder in order to obtain the CSV (comma-separated values file) called database.csv, which holds the data.
unzip(zipfile="./homicide.zip")
data <- read.csv("database.csv")
glimpse(data)
## Rows: 638,454
## Columns: 24
## $ Record.ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ Agency.Code <chr> "AK00101", "AK00101", "AK00101", "AK00101", "AK0…
## $ Agency.Name <chr> "Anchorage", "Anchorage", "Anchorage", "Anchorag…
## $ Agency.Type <chr> "Municipal Police", "Municipal Police", "Municip…
## $ City <chr> "Anchorage", "Anchorage", "Anchorage", "Anchorag…
## $ State <chr> "Alaska", "Alaska", "Alaska", "Alaska", "Alaska"…
## $ Year <int> 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1980, …
## $ Month <chr> "January", "March", "March", "April", "April", "…
## $ Incident <int> 1, 1, 2, 1, 2, 1, 2, 1, 2, 3, 1, 2, 3, 1, 2, 3, …
## $ Crime.Type <chr> "Murder or Manslaughter", "Murder or Manslaughte…
## $ Crime.Solved <chr> "Yes", "Yes", "No", "Yes", "No", "Yes", "Yes", "…
## $ Victim.Sex <chr> "Male", "Male", "Female", "Male", "Female", "Mal…
## $ Victim.Age <int> 14, 43, 30, 43, 30, 30, 42, 99, 32, 38, 36, 20, …
## $ Victim.Race <chr> "Native American/Alaska Native", "White", "Nativ…
## $ Victim.Ethnicity <chr> "Unknown", "Unknown", "Unknown", "Unknown", "Unk…
## $ Perpetrator.Sex <chr> "Male", "Male", "Unknown", "Male", "Unknown", "M…
## $ Perpetrator.Age <int> 15, 42, 0, 42, 0, 36, 27, 35, 0, 40, 0, 49, 39, …
## $ Perpetrator.Race <chr> "Native American/Alaska Native", "White", "Unkno…
## $ Perpetrator.Ethnicity <chr> "Unknown", "Unknown", "Unknown", "Unknown", "Unk…
## $ Relationship <chr> "Acquaintance", "Acquaintance", "Unknown", "Acqu…
## $ Weapon <chr> "Blunt Object", "Strangulation", "Unknown", "Str…
## $ Victim.Count <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Perpetrator.Count <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, …
## $ Record.Source <chr> "FBI", "FBI", "FBI", "FBI", "FBI", "FBI", "FBI",…
Agency Type: Law enforcement Agency who handled the case
State/City: State and Counties of the reported homicides
Year/Month: Time stamp of the homicides
Crime Type: Murder, Manslaughter or Negligence designated to case
Crime Solved: Whether the case has been solved or not
Victim Sex/Age/Race: Victim profile
Perpetrator Sex/Age/Race: Perpetrator profile
Relationship: The perpetrators relation to the victim
Weapon: Weapon used to commit homicide
Case Open/Closed: Change the designation of a crime being solved.
Solve Rate: Percentage of Homicide Reports where the case was closed
Top murder cases by state, we are taking our unfiltered data to display the top murder cases by State to compare which states have the most murders. To do this, we needed to plot our data by grouping by state. We then needed to show the total of murders per state and arranged by murder count from descending order.
data %>% group_by(State) %>%
summarize(Murder_Count = n()) %>%
arrange(desc(Murder_Count)) %>%
kbl() %>% kable_paper() %>% scroll_box(height = "300px")
| State | Murder_Count |
|---|---|
| California | 99783 |
| Texas | 62095 |
| New York | 49268 |
| Florida | 37164 |
| Michigan | 28448 |
| Illinois | 25871 |
| Pennsylvania | 24236 |
| Georgia | 21088 |
| North Carolina | 20390 |
| Louisiana | 19629 |
| Ohio | 19158 |
| Maryland | 17312 |
| Virginia | 15520 |
| Tennessee | 14930 |
| Missouri | 14832 |
| New Jersey | 14132 |
| Arizona | 12871 |
| South Carolina | 11698 |
| Indiana | 11463 |
| Alabama | 11376 |
| Oklahoma | 8809 |
| Washington | 7815 |
| District of Columbia | 7115 |
| Arkansas | 6947 |
| Colorado | 6593 |
| Kentucky | 6554 |
| Mississippi | 6546 |
| Wisconsin | 6191 |
| Massachusetts | 6036 |
| Nevada | 5553 |
| Connecticut | 4896 |
| New Mexico | 4272 |
| Oregon | 4217 |
| Minnesota | 3975 |
| Kansas | 3085 |
| West Virginia | 3061 |
| Utah | 2033 |
| Iowa | 1749 |
| Alaska | 1617 |
| Hawaii | 1338 |
| Nebraska | 1331 |
| Rhodes Island | 1211 |
| Delaware | 1179 |
| Idaho | 1150 |
| Maine | 869 |
| New Hampshire | 655 |
| Wyoming | 630 |
| Montana | 601 |
| South Dakota | 442 |
| Vermont | 412 |
| North Dakota | 308 |
After we group all the states in the US and the number count of murders, we plot them into a interactive histogram. It helped to see how much more murders were in California than other states. Also Texas and NY, showed the spiked than the other states.
In table and graph, they show us that the state with the most kills in which California is the state with the most murder cases with Texas coming in second and New York third.
## # A tibble: 10 × 2
## State Murder_Count
## <chr> <int>
## 1 California 99783
## 2 Texas 62095
## 3 New York 49268
## 4 Florida 37164
## 5 Michigan 28448
## 6 Illinois 25871
## 7 Pennsylvania 24236
## 8 Georgia 21088
## 9 North Carolina 20390
## 10 Louisiana 19629
We then wanted to focus on the top ten most murder cases per state to help optimize the differences. We made another interactive histogram, you can see California and Texas having the most murder cases. They are one of the largest states in the US, so it helps to know why they would have the most kills. New York is not as a massive state as TX or CA, but it still ranked number three in the most murder cases. Our assumption is since NY is one of the biggest city in the US, the city that never sleeps and one the of the most tourist city, there would be more crimes than those with smaller states. By seeing the histogram, it brought up the question to us wanting to know what weapon is being used in this murder cases ?
We count how many times killers used each kind of weapons to see the top of their weapon choice.
data %>% group_by(Weapon) %>%
summarize(Most_Weapon_Used = n()) %>%
arrange(desc(Most_Weapon_Used)) %>%
kbl() %>% kable_paper() %>% scroll_box(height = "300px")
| Weapon | Most_Weapon_Used |
|---|---|
| Handgun | 317484 |
| Knife | 94962 |
| Blunt Object | 67337 |
| Firearm | 46980 |
| Unknown | 33192 |
| Shotgun | 30722 |
| Rifle | 23347 |
| Strangulation | 8110 |
| Fire | 6173 |
| Suffocation | 3968 |
| Gun | 2206 |
| Drugs | 1588 |
| Drowning | 1204 |
| Explosives | 537 |
| Poison | 454 |
| Fall | 190 |
Here you can see that the handgun is the most “favorite” weapon of serial killers compare to other weapons. We also have a good amount of “unknown” weapon, we will need to clean out our data to make it more cleaner and get more accurate results.
Here we wanted to visualize the highest crime counts in the US. Heatmaps are great when focusing on locations that matter the most. In this case, we see CA being red compare to other states. Also, in this heatmap, it shows how in Northern US there is less crime count. Lasty, if you look at the north side of the US, there are less murder cases compare to other regions.
Now let’s focus on the best state of the US, Texas. Unfortunately, Texas comes in second with the biggest crime rates. We wanted to see what county/city had the biggest crime rate. Kaufman county of Texas had the highest crime rate.
We wanted to include California since it has the highest among all
other states to see where most of the murders are.
California is broken down into cities instead of counties.
In California most of their murder cases came from the city of Los Angeles and the southside of California compared to Northern California.
We count amount of cases based on data about genders by state. We took the dataset to combine the total amount of Female count by state with Male total count together into 1 dataset to compare it.
We made two heatmaps, one to show male murder cases and the other one to show the female murder cases in the US. We can see some similarities within male and female murders. They both have a lot of murder cases in California, Texas, and New York.
#As an example, let's see how they show in gender of victims field!
data %>% group_by(Victim.Sex) %>% summarize(Gender = n())
## # A tibble: 3 × 2
## Victim.Sex Gender
## <chr> <int>
## 1 Female 143345
## 2 Male 494125
## 3 Unknown 984
In this table as you can see, we have 984 that are unknown so we need to tidy up our data and get rid of the unknowns.
#How about Unknown Weapon?
data %>% group_by(Weapon) %>%
summarize(Most_Weapon_Used = n()) %>%
arrange(desc(Most_Weapon_Used)) %>%
kable() %>% kable_paper() %>% scroll_box(height = "300px")
| Weapon | Most_Weapon_Used |
|---|---|
| Handgun | 317484 |
| Knife | 94962 |
| Blunt Object | 67337 |
| Firearm | 46980 |
| Unknown | 33192 |
| Shotgun | 30722 |
| Rifle | 23347 |
| Strangulation | 8110 |
| Fire | 6173 |
| Suffocation | 3968 |
| Gun | 2206 |
| Drugs | 1588 |
| Drowning | 1204 |
| Explosives | 537 |
| Poison | 454 |
| Fall | 190 |
You can see that handgun was 10x most used weapons compared to others. Knife was the second most used weapon. It is interesting know that the hand gun was the most used weapon, it makes us wonder if gun laws were regulated in the US would it make the handgun the least used of weapon and make the knife the most used weapon. And 33192 unknown weapons with many different kinds of guns need to be sorted out.
# Graph for cases by age
data %>% ggplot(aes(Victim.Age)) + geom_histogram(binwidth = 50) +
labs(title = "How many cases over victims' ages?",
x = "Age of Victim (years old)", y = "Cases")
From then, we see that there are many cases with nearly 1000 year-old victims. It doesn’t make sense so then we proceeded to filter our data to make it more neat and coherent.
## Rows: 346,656
## Columns: 5
## $ Year <int> 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1980, 198…
## $ Victim.Age <int> 14, 43, 43, 30, 42, 99, 20, 36, 31, 16, 33, 27, 33, 31, 2…
## $ Victim.Sex <chr> "Male", "Male", "Male", "Male", "Female", "Female", "Male…
## $ Relationship <chr> "Acquaintance", "Acquaintance", "Acquaintance", "Acquaint…
## $ Weapon <chr> "Blunt Object", "Strangulation", "Strangulation", "Rifle"…
In our filtered data we decided to work with data that we would find useful for our findings and remove all the ‘unknowns’ in the dataset. In our new filtered data we decided to work with Year, Victim’s Age, Victim’s Sex, Relationship to their perpetrator and type of Weapon used in each incident. We also filtered the victim’s age to be more accurate from combining them from age 1 to 100.
We wanted to observe the number of cases throughout the three decades.
| Year | n |
|---|---|
| 1980 | 14384 |
| 1981 | 14184 |
| 1982 | 13896 |
| 1983 | 13184 |
| 1984 | 12597 |
| 1985 | 12432 |
| 1986 | 12993 |
| 1987 | 12283 |
| 1988 | 11674 |
| 1989 | 12415 |
| 1990 | 12916 |
| 1991 | 12771 |
| 1992 | 12625 |
| 1993 | 12961 |
| 1994 | 12198 |
| 1995 | 11242 |
| 1996 | 9822 |
| 1997 | 9151 |
| 1998 | 8266 |
| 1999 | 7372 |
| 2000 | 7063 |
| 2001 | 7373 |
| 2002 | 7782 |
| 2003 | 7665 |
| 2004 | 7580 |
| 2005 | 7654 |
| 2006 | 7742 |
| 2007 | 7519 |
| 2008 | 6690 |
| 2009 | 7247 |
| 2010 | 6944 |
| 2011 | 6730 |
| 2012 | 6716 |
| 2013 | 6329 |
| 2014 | 6256 |
Now with our filtered data, we wanted to see the crimes rates throughout the years of 1980-2014. In 1980 & 1993, you can see that there is a peak in crimes rates but then they start to decreased. In 1980, the crime rate was high due to a severe global economic recession and inflation peaked in the US by 14.76%
# The rate of murders during the period of 1980-2014
filtered_data %>%
group_by(Year) %>%
summarise(murder = n()) %>%
ggplot(aes(Year,murder)) + geom_point() + geom_smooth()
Then we wanted to see if the perpetrator had knew their victim before striking. So we made a variable based on their relationship.
## Rows: 346,656
## Columns: 6
## $ Year <chr> "1980", "1980", "1980", "1980", "1980", "1980…
## $ Victim.Age <chr> "14", "43", "43", "30", "42", "99", "20", "36…
## $ Victim.Sex <chr> "Male", "Male", "Male", "Male", "Female", "Fe…
## $ Relationship <chr> "Acquaintance", "Acquaintance", "Acquaintance…
## $ Weapon <chr> "Blunt Object", "Strangulation", "Strangulati…
## $ Relationship_with_murder <chr> "Known", "Known", "Known", "Known", "Known", …
Looking at this table, there is a higher chance that the victim knows their perpetrator. It is a huge difference.
#Count how many cases they know each other
relationship_data %>% group_by(Relationship_with_murder) %>% summarise(cases = n())
## # A tibble: 2 × 2
## Relationship_with_murder cases
## <chr> <int>
## 1 Known 253368
## 2 Unknown 93288
In this pie chart, we wanted to know when the victim knew their perpetrator or had some type of relationship or if they were simply strangers. It is clear that the amount of victims who knew the murder is nearly three times than the amount of victims who were strangers to the perpetrators.
We wanted to count the Victim Sex and see the graph so the data is filtered here and in my findings, there is a higher percent for a male to be murdered than a female.
## Victim.Sex n
## 1 Female 89362
## 2 Male 257294
# Graph for Victim Sex
filtered_data %>% ggplot(aes(Victim.Sex, fill = Victim.Sex)) +
geom_bar(color = 'black') + theme_bw() +
geom_text(aes(label = ..count..), stat = "count", vjust = 5) +
labs(title = "Which gender is the most targeted?",
x = "Victim Gender", y = "Cases", fill = "Victim Gender")
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Since we made a table to show what gender is killed the most we wanted to see a bar chart to compare both genders. In our analysis, it helps to see how much more a male is to be murdered.
Then we wanted to see the correlation based on the victim’s gender and it’s perpetrator relationship.
# We wanted to see victim's gender and the correlation of their relationship?
relationship_data %>% ggplot(aes(Victim.Sex, fill = Relationship_with_murder)) +
geom_bar(color = 'black') +
theme_bw()+
geom_text(aes(label = ..count..), stat = "count", vjust = 1) +
labs(title = "How many cases do they know each other by genders? ",
x = "Victim Gender", y = "Cases", fill = "Relationship with murder")
From above graph, it is obvious that most of victims know the murders before the incident
How does the distribution of cases look like by victim’s age?
filtered_data %>% ggplot(aes(Victim.Age)) +
geom_histogram(color = 'Black', fill = 'white', binwidth = 3) +
labs(x = "Victim Age", y = "Cases")
We could see that the average age of a victim to be most likely murdered are the ages 21-25. However, it is not really clear to determine if the age is an effected factor on the rate of murder cases. So, let’s take a look at this flow.
To summarize everything some important takeaway from our analysis is that the perpetrator gets acquainted with the victims before committing murder. Men are more like to strike someone they know than a stranger. Women also are more likely to strike someone they know. This is important to know in today’s world because we are more likely to be murdered by someone we know rather than by a stranger. We often find ourselves feeling more safe to be with someone we know rather than a stranger. Based on our findings, men are 3x more likely to be murdered than women.The majority of homicide victims are around 30 years old. However, this age is not exclusive, and victims can be of any age. Also, the perpetrator usually is between the ages of 21 and 25. In comparison, the victim profile is more common in older age brackets. Lastly, the most used weapon used in a crime scene has been a handgun. It makes us wonder if gun law’s were regulated in each state, would it reduce crime since it is easily accessible to acquire one. The underlying factors and motives of a serial killer is that they all may have different motives, where it can be in desperate need of money, power, sex, etc but something about them is that there are always to strike again.